Historical global ocean wave data simulated with CMIP6 anthropogenic and natural forcings

This dataset presents historical ocean wave climate during 1960–2020, simulated using the numerical model WAVEWATCH III (WW3) forced by Coupled Model Intercomparison Project phase 6 (CMIP6) simulations corresponding to natural-only (NAT), greenhouse gas-only (GHG), aerosol-only (AER) forcings, combined forcing (natural and anthropogenic; ALL), and pre-industrial control conditions. Surface wind at 3-hourly temporal resolution, and sea-ice area fraction at monthly frequency, from a CMIP6 model - MRI-ESM2.0 are used to force WW3 over the global ocean. Model calibration and validation of the significant wave height are carried out using inter-calibrated multi-mission altimeter data produced by the European Space Agency Climate Change Initiative, with additional corroboration using ERA-5 reanalysis. The simulated dataset is assessed for its skill to represent mean state, extremes, trends, seasonal cycle, time consistency, and spatial distribution over time. Numerically simulated wave parameters for different individual external forcing scenario is not available yet. This study produces a novel database particularly useful for detection and attribution analysis to quantify the relative contributions of natural and anthropogenic forcings to historical changes.


Background & Summary
The relevance of surface ocean wind waves is manifold, from environmental to socioeconomic impacts on human society. A wide range of applications of wave information already stated by many previous literature include ship routing 1 , coastal and offshore engineering activities 2 , assessment of coastal hazards 3 , wave energy extraction planning 4 , estimation of ocean-atmosphere exchange 5 , among others. The importance of dynamic coupling of wave processes into coupled ocean-atmosphere global climate models has been emphasized since long back as surface wind-waves are capable of directly affecting climate system through continuous ocean-atmosphere exchange 5 .
Despite being a key component of the climate system 6 , ocean waves data are not available from the state-of-art global climate models (GCM). Using outputs of CMIP5/CMIP6 GCMs to force wave model, several ensembles of global wave climate projections are available for historical as well as for future periods, especially within the framework of the Coordinated Ocean Wave Climate Project (COWCLIP) 7 . In order to understand wave climate response to growing anthropogenic interference, we need ocean wave information for different forcing scenario. Anthropogenic greenhouse gases and aerosols are major external forcings to global climate change 8,9 . CMIP6 multi-model simulations 10 provide scenarios for various external forcings under the Detection and Attribution Model Intercomparison Project (DAMIP) 11 . The goal of DAMIP is to attribute historically observed climate change to anthropogenic and natural causes, to derive observationally constrained projections of future climate change, and to improve understanding of the mechanisms by which particular forcings affect climate. Following the approach of Tier 1 experiments of DAMIP, we aim to create novel wave database corresponding to CMIP6 historical, hist-nat (natural-only), hist-GHG (greenhouse-only) and hist-aer (aerosol-only) experiments (described in Table 1) separately. Although few recent wave climate projections [12][13][14] are available using CMIP6 forcings, these include all forcings together. The DECK (Diagnostic, Evaluation and Characterization of Klima) experiment 10 within CMIP6 framework provides pre-industrial control simulation, which is representative of the period prior to the onset of large-scale industrialization, with 1850 being the reference year. The assumptions of the control simulation is that there are no changes in both naturally occurring forcings (e.g. volcanoes, orbital or solar changes) and human-induced forcings. Control simulation can thus be used to study the unforced internal variability of the climate system 11 . In addition to CMIP6 historical simulations, we aim to simulate wave conditions corresponding to pre-industrial control simulation using DECK results which will be useful to estimate the range of internal climate variability.
Here, we provide 6-hourly time-series of wave parameters (e.g. significant wave height, mean wave period, mean wave direction) for 1960-2020 for different forcing scenario. This extensive digital wave database can be used by various scientific community and policymakers. A direct application is detection and attribution (D&A) studies of historical wave climate change to anthropogenic and natural causes, specifically relative contribution of anthrogopenic greenhouse and anthropogenic aerosols. This kind of D&A studies have never been attempted before for wave climate, to the best of our knowledge, owing to lack of dataset for different forcing scenario.

CMIP6 forcings.
We derive surface winds (u and v components) at 3 hourly interval and sea ice area fraction at monthly interval from CMIP6 dataset (https://esgf-node.llnl.gov/projects/cmip6/) for historical (natural plus all anthropogenic, called ALL), greenhouse-only (GHG), natural-only (NAT), and aerosol-only forcing (AER) experiments. In CMIP6 dataset, there are many models which can provide surface winds at daily and monthly frequency for each of the above four experiments, but not at the 3 hourly frequency required to accurately reproduce synoptic-scale ocean wave generation. Therefore, we selected the MRI-ESM2.0, the Earth System Model (Version 2.0) by the Meteorological Research Institute 15 , as it provides 3 hourly surface wind (1.125° × 1.11209°~1.12148°), and monthly sea-ice area fraction for each of these four CMIP6 experiments: ALL, GHG, AER, NAT.
MRI-ESM2.0 is a global climate model consisting of four major component models: an atmospheric general circulation model (AGCM) with land processes, an ocean-sea-ice general circulation model, an aerosol model, and an atmospheric chemistry model. This climate model has been shown to display realistic reproduction of both mean climate and interannual variability 15 . Following a preindustrial spin-up, a preindustrial control experiment was conducted with MRI-ESM2.0 as well as a set of historical simulations from the 1850 to the present, which were driven by forcing based on observations. These experiments are conformed to the protocol specified by CMIP6.
The ALL simulations generally end in 2014, but simulations are extended to 2020 with the Shared Economic Pathway (SSP) 2-4.5 scenario. SSP2-4.5 16 is intermediate level of greenhouse gas emissions and its future aerosol and land use changes represent a broad range of SSP-based integrated assessment model projections 11 . Following DAMIP project, CMIP6 historical + SSP2-4.5 experiment is called "historical" here and covers the period 1960-2020. Five variants (r1i1p1f1, r2i1p1f1, r3i1p1f1, r4i1p1f1, r5i1p1f1) are available for the whole 1960-2020 period for each of the four experiments (Table 1 and Fig. 1). The different variant label defines the realization number (r), the initialization index (i), the physics index (p), and the forcing index (f); all being positive integer. For a given CMIP6 experiment (for example: Hist-GHG), the "ripf " identifier is used to uniquely identify each simulation of an ensemble of runs contributed by a single model (for more details, visit https:// github.com/WCRP-CMIP/CMIP6_CVs). The realization index distinguishes among members of an ensemble of simulations that differ only in their initial conditions (e.g. initialized from different points in a control run). The initialization index is used to distinguish simulations performed under the same conditions but with different initialization procedures. The physics index indicates the physics version used by the model, and the forcing index is used to distinguish different variants of forcing applied. Since the control simulation (CTL) is considered as time invariant, available 200 years of MRI-ESM2.0 CTL data is divided into 3 sub-periods of 60 years which is the length of historical simulations.
Wave modelling. WAVEWATCH III version 7.14 (The WAVEWATCH-III Development Group, 2019) was implemented over a near-global grid (78° S to 80° N) with a spatial resolution of 1.0° and using the GEBCO ocean bathymetry. The wave spectrum is discretized in 24 directions (15° directional resolution), and 36 exponentially spaced frequencies from 0.034 to 0.95 Hz, with a 1.1 increment factor from one frequency to the next. The maximum global time step, the maximum Courant-Friedrichs-Lewy (CFL) time steps for advection (x-y) and refraction (k-theta), and the minimum source term time step are set as 1800s, 300 s, 300 s, and 10 s, respectively. The wind input and wave dissipation source terms are from ST4 parameterization described in Ardhuin et al. 17 , with adjustments described in Alday et al. 18 . Non-linear wave interactions are modelled using the Discrete Interaction Approximation (DIA) 19 .
Here, we adjusted the β max parameter, which is a non-dimensional wind-wave growth coefficient that has been used as a tuning parameter to calibrate spectral wave models at global scale for wind strength biases 20  www.nature.com/scientificdata www.nature.com/scientificdata/ (SWH) obtained with the historical-ALL simulations and SWH from the inter-calibrated and denoised multi-mission satellite altimeter dataset (L4 product, version 1.1 21 ) provided by the European Space Agency Climate Change Initiative (ESA-CCI) 22 (Fig. 1). For model calibration, we looked at the systematic error statistics (bias and normalized bias: Nbias) computed over the full year of 2011 (same as Alday et al. 18 ) using β max values increasing from 0.9 to 1.75 ( Table 2). The calibration analysis is based on a single year (enough to capture seasonal variability) in order to deal with computational demand. In this calibration analysis, the observation is 1 Hz along-track altimeter data and 6-hourly model outputs are interpolated at the position and time of altimeter measurements. First, the error metrics are computed for a single member from ALL simulations. We start with β max = 1.75, as used by the reference study 18 , which displays high positive bias (not shown) and Nbias (Fig. 2d). After reducing β max around 1, lower Nbias values are found. The area-averaged (60° S to 60° N) statistical values presented in Table 2 show the lowest bias (−0.04 m) and Nbias (−1.82%), when β max = 1. These global average values are calculated after removing the outliers, from the spatial map, which are defined as values more than three scaled median absolute deviations from the median.
Following β max = 1.0, all the members are shown to produce low bias (not shown here). Additionally, the standard deviations (of bias and Nbias) among the members are quite low. Here we acknowledge that the CMIP6 wind-forcing resolution is not adequate to reproduce the precise timing of storms 23 . Hence the random error metrics (scatter index: SI) using 6-hourly outputs are usually large (around 40%), even though the model captures the long-term climate well. After the β max = 1.0 chosen with this calibration, the model setting is retained for the entire simulation. Thereafter, the model simulations are validated through Nbias, SI, SC (spatial correlation), NSD (normalized standard deviation) and CRMSD (centered root mean squared difference) metrics using monthly mean values for the entire reference data duration. The statistical parameters, used in this study, are defined as follows:  www.nature.com/scientificdata www.nature.com/scientificdata/ where M and R are significant wave heights from model simulation and reference data, respectively. And σ is standard deviation.

Data Records
The full global wave climate dataset 24 can be accessed at https://doi.org/10.12770/0983962b-4acc-4f8f-9484-e2195029b87b. The dataset consists of 6 hourly wave parameters (81 variables including significant wave height, mean wave periods, wave mean direction for total wave energy and partitions, among others), which are detailed in Supplementary Table 1, for 5 variants for each of the four CMIP6 historical experiments (ALL, NAT, GHG, AER) and three sets for CTL experiments (total = 23 runs) covering 60 years (1961-2020).
The 6-hourly outputs are saved into monthly files, so we have 12 × 60 × 23 netcdf files. The hierarchy for folders is set as: GLOB1D_ exp_id/ 0r/year/. The gridded outputs are in the folder /FIELD_NC; and the filenames are defined as: LOPS_WW3-glob-1d_YearMonth.nc,where Exp_id is the name of CMIP6 experiments, ALL, NAT, GHG, AER, CTLand 0r is the realization number of the variants (for example: 01 for r1i1p1f1, 02 for r2i1p1f1, and so on).
LOPS_WW3-glob-1d is WW3 global hindcast produced at LOPS for global ocean at 1° resolution. The folders other than /FIELD_NC are: "data/" folder which has all the configuration files including bottom topography, namelist, etc. "work/" folder which has preprocessing file descrption used for each run. "output/" folder having output field description and others. "SPEC_NC/" folder having spectral energy information at the points listed in "/data/points_glob-1d.txt".

technical Validation
In order to use this dataset for future wave climate research, we evaluate the skill of our model simulations (ensemble mean of ALL) using various metrics with reference to ESA-CCI altimeter 22 and ERA5 25 dataset. The skill of the wave database to capture the mean state is assessed through Nbias and SI maps from monthly mean data for annual and seasonal scales. At annual scale, the Nbias values are typically within 10% (Fig. 3), and www.nature.com/scientificdata www.nature.com/scientificdata/ SI values are typically within 15%, with values higher than 15% mostly located in high latitude, coastal and enclosed seas. Overall, it represents good agreement with altimeter observations with global averaged values −3% and 17% for Nbias and SI, respectively. The Nbias values are significantly lower than the ones obtained for the calibration of the model (shown on Fig. 2), as we used monthly mean and ensemble mean of 5 members. In DJF and JJA seasons, the Nbias and SI values are of similar ranges, but with regional variations (Fig. 3). Exception is the Arabian Sea, which shows overestimation in DJF season and underestimation in JJA season by WW3 simulation compared to altimeter observation. It seems that MRI-ESM2.0 has difficulties in resolving the South Asian Monsoon accurately. In addition, few localized abnormalously high bias values are scattered over tropical Indian Ocean which might occur due to unresolved sub-grid scale physics by our WW3 set-up with 1.0° horizontal resolution. Apparently, the current model discretization is coarse to capture island (Hawaii, Seychelles, Maldives, Chagos) blocking effect. It is noteworthy to mention that altimeter SWH measurements at low sea states (<0.75 m) is usually less accurate and noisier 26 , which might also be the reason for high SI values in coastal areas. We see high SI values over the regions with strong current, specially western boundary currents. In this context, Alday et al. 18 reported reduction in scatter index for the regions with stronger current (such as the Gulf stream, the Kuroshio, the Agulhas, etc.) when current is added as forcing to wave model. Unfortunately, MRI-ESM2.0 does not provide data for ocean currents. Note that, although MRI-ESM2.0 has improved over the previous versions and demonstrates realistic aspects of many climate components, still it has drawbacks in terms of double intertropical convergence zone (ITCZ) bias, cold biases in the Southern Hemisphere (SH) surface air temperature, among others 15 . The inconsistency between model and observation can arise due to combination of multiple factors, including forcing field, parameterization, model discretization, and altimeter measurement errors. Here, model discretizations are a balance between the computational cost and the level of accuracy. The 60 years of hindcast at 6-hourly interval for 5 CMIP6 experiments, each having 5 variants, already demands a large computation storage.
A slightly stronger Nbias and reduced SI values are seen when the model simulations are compared with ERA5 reanalysis (Fig. 4). Interestingly, spatial structure (positive and negative errors) and seasonal variation of Nbias pattern are similar for both reference data set (ESA-CCI altimeter and ERA5), which is partly due to the assimilation of altimeter data in ERA5. More specifically, estimates of Nbias computed with ERA5 show higher overestimation over the mid-latitudes and lesser underestimation over the intertropical regions compared to Nbias computed with altimeter, in agreement with the known differences between altimeter and ERA5 22 . The lower scattered index estimates can be explained by smoother ERA5 values compared to altimeter observations. The global averaged values are 2% and 12% for Nbias and SI, respectively.
It is important to acknowledge that differences in altimeter data processing methods may induce significant differences between multi-mission altimeter products 27,28 . Moreover, different global wave hindcasts and/ or wave reanalysis often show disparate results regarding climatology, variability; as evidenced by Morim et al. 29 and Erikson et al. 30 . As a consequence, the calibration and validation results may potentially vary depending on the selected reference dataset (observations or model). Since it is not the objective of this paper to compare different altimeter and reanalysis products, the reader could to refer to Morim et al. 29 and Erikson et al. 30 for comparative assessments of reference altimeter and model datasets.
As expected, higher error is seen for monthly maxima of SWH than for monthly mean values when compared with altimeter (Figure not shown). Model overestimates almost everywhere with larger values over the mid-latitudes of both hemisphere, and global averaged values are around 20%. The spatial pattern of SI map bears resemblance to that of monthly mean values (Fig. 3), but with amplified magnitude. As acknowledged in Young and Ribal 31 and Timmermans et al. 27 , differences in extremes between products are more likely to be www.nature.com/scientificdata www.nature.com/scientificdata/ higher than in mean values. These large errors can be partially explained by the low revisit time period of altimeter missions, causing significant undersampling errors in multi-mission altimeter products 32 , which poorly resolve the extreme storm events.
Given the noisiness of maximum wave heights, monthly 99th percentile SWHs (H s 99 ) are also validated with reference to ERA-5 wave height (Fig. 5). The spatial pattern of Nbias for H s 99 shows resemblance to that of monthly mean values with higher overestimation in the storm track regions (high latitudes). Notable negative biases prevail over the Arabian Sea and east China sea during summer monsoon season. Contrasting relative difference for Arabian Sea between the two seasons is visible just in the case of monthly mean values. In case of wave period (T z ), the Nbias values are within 10(%), except for east tropical Atlantic. However, limited overestimation is evident along the tropical latitudes of both hemispheres, in the so-called swell pools 33 . The mean wave direction from model shows positive (clockwise) biases along the tropical band, and slightly lower negative  www.nature.com/scientificdata www.nature.com/scientificdata/ (anticlockwise) biases over the high-latitudes. The clockwise rotation can be explained by strong westerly component along the strom belt and the anticlockwise rotation by strong southerly component associated with strong swells from Southern Ocean; and similar pattern is reported by Lemos et al. 34 when comparing with ERA5 dataset. Overall, our model simulations can reproduce the major spatial patterns and seasonal variations as in ERA5. The global average values for normalized bias for H s 99 , T z and bias for MWD (using ERA5 as reference) are 3.07(%), 2.2(%), and 0.9 (°), respectively (Table 3). In addition to Nbias and SI, metrics such as SC, NSD, and CRMSD are also provided; as in COWCLIP framework 23 12 .
To assess multi-annual variability, comparison of linear trend in annual mean wave height is presented in Fig. 6. Generally, the percentage change per year among three datasets agree on spatial distribution and sign of trends. Quantitatively, the range of the trends are also in agreements, although the altimeter trends are slightly stronger in certain regions (e.g. tropical Atlantic). Strong negative trends in the North Pacific is clearly consistent among the datasets. Other regions of consistent trends include Atlantic and eastern Pacific sector of Southern Ocean and tropical Atlantic. Nevertheless, certain differences in trends among three dataset are evident over Arabian Sea and mid-latitudes of south Pacific.
Furthermore, we validate the ability of CMIP6-forced wave model to represent seasonal cycle. The modeled seasonality (difference between January and July wave height) is strongest in Northern Hemisphere (NH) extra-tropical storm belt with lower values over tropics (Fig. 7a). In SH, it is strongest over Indian Ocean sector of Southern Ocean, which is in agreement with Hemer and Trenham 35 . Looking at the bias in seasonality (Fig. 7b), it is encouraging that difference of seasonal amplitude between CMIP6 forced simulation and altimeter is generally lower than 0.5 m, except for the Arabian Sea and certain sectors of the Southern Ocean, which coincides with regions with strong bias (Fig. 3). Overall, the reduced agreement in Southern Ocean can be because of combination of factors, such as presence of circumpolar current, presence of both sea-ice and icebergs, the parameterizations of wave-ice interactions. In a quantitative view, difference in seasonal amplitude is reasonable with a low global mean value (−0.12 m). The temporal correlation map (using monthly mean values) reveal statistically significant relation almost everywhere in global ocean, with higher values (>0.7) in NH (Fig. 7c). Surprisingly, low to no correlation is seen over narrow tropical band and over a narrow latitudinal band close to Southern Ocean. These regions having low correlation coincides with regions having low seasonal or intra-annual variability (Fig. 7a). The spatial patterns of simulated dataset correlated well with the spatial pattern of altimeter observation over the years (blue curve on Fig. 7d), always higher than 0.75 except for the year 1992, probably because of altimeter undersampling issue 32 . To further evaluate spatial variability, difference in spatial standard deviation between WW3 simulation and altimeter is presented (red curve on Fig. 7d). The spatial variance of simulated dataset are generally within the same range as the observation, to be more specific, model variance is slightly stronger than observation since the year 2000. To summarise, this WW3 simulations produce considerably realistic representation of the historical global wave climate.
To present a brief overview of our simulated wave data, 60-year mean climatology for all the CMIP6 experiments: ALL, GHG, AER, and NAT are displayed in Fig. 8. Note that for GHG, AER and NAT scenarios, anomaly  Table 3. Global averaged error metrics for CMIP6_WW3 ALL simulations (5-member mean) for reference data period (ESA-CCI :1992-2018 and ERA5: 1980-2020). Note that for MWD, the absolute bias value is provided (in°), instead of the normalized bias (%).